COVID-19 Vaccine Demographics Analysis

By: Hailey Dusablon and Jordan Stein

1. Introduction

Project Description

As partners (Hailey Dusablon and Jordan Stein), we are interested in exploring socioeconomic factors that may be correlated to health issues, specifically focusing on COVID-19 vaccination rates and what factors have the greatest impact on these numbers. Since we are both interested in careers in the healthcare field, analyzing this type of data would be a great opportunity to gain some insight into how lifestyle factors may affect the health of populations. This type of research is important because lifestyle choices can be changed, so it is useful to know how our everyday actions can impact our health. The first dataset we are interested in analyzing is the COVID Data Tracker published by the CDC (which can be found here: https://covid.cdc.gov/covid-data-tracker/#datatracker-home). This dataset contains information related to the prevalence of covid cases, deaths, and trends at the national, state, and county levels in the United States. More specifically, this dataset contains information regarding the state-wide distribution of people who have received the COVID vaccine. One question that would be interesting to explore with this data is how socioeconomic factors, such as poverty and minority group distribution, affect the spread of covid and the vaccination rates in different states and counties. In order to do this, we will be using the social determinants of health categories laid out by the U.S. Department of Health and Human services (https://health.gov/healthypeople/objectives-and-data/social-determinants-health) in order to choose factors to research as potential variables that affect vaccination rates in a state.cWe would also like to see if the regions with lower rates of the covid vaccine also have lower rates of other types of common vaccines. Since there is a significant stigma surrounding the COVID vaccine, it would be interesting to see how the COVID vaccine rates compare to other vaccine rates. We are considering this research because COVID has been a very relevant problem for the past year and a half and the COVID vaccine has been very controversial. We have not seen a lot of covid research studying social factors such as religion, access to healthcare resources, and income.

Collaboration Plan:

To begin the project, we started by setting up a Github repository and ensuring that we were both listed as collaborators. Since the website was created with Hailey’s account, Jordan cloned the repository on her computer. After this website was established, we met on Zoom to discuss datasets we were interested in exploring and possible ideas about how we would analyze these datasets. For future collaboration, we are planning to meet twice a week on Tuesdays and Thursdays for two hours each day via Zoom or in-person and will be coordinating code through a private Github repository.

Challenges with Obtaining Data:

The biggest challenge we had was choosing what factors to research to achieve our goals. By choosing to focus on the social determinants of health, we were able to narrow down what data we were looking for. We were then able to find relevant data. A lot of the data sets that we found were so extensive that they were too big for Pandas to handle without crashing, so some data sets had to be edited down in Microsoft Excel in order for them to be viable for use. We also had issues in creating the dataframe with deciding how to tidy and reformat the data. We decided it would be best to drop the columns that we would not need for our research purposes. Lastly, we dropped all the columns about the booster vaccine since the data was only available at a national level (because the booster vaccine was only developed very recently). Another challenge in starting this project was learning how to navigate GitHub. It took a bit of trial-and-error to figure out how to organize our information on the site. Now, we are more familiar and comfortable with using GitHub.

Project Goals:

A CDC analysis found that "counties with high social vulnerability had lower vaccination rates than counties with low social vulnerability." More information about this research can be found here: https://www.kff.org/coronavirus-covid-19/issue-brief/vaccination-is-local-covid-19-vaccination-rates-vary-by-county-and-key-characteristics/

The goal of this project is to use Data Science techniques to examine data sets about COVID vaccination coverage in the US and see how this data relates to socioeconomic status. The motivation for this investigation is to determine any significant correlations between various socioeconomic factors and COVID vaccination rates. The same source listed above states that "Ensuring access to COVID-19 vaccines for these communities can help address the disparate health effects of the virus and achieve herd immunity." This type of research can be helpful in identifying specific factors that may affect the decision-making processes for vaccine distribution.

Using available data, we will examine the social determinants of health one by one as we examine our data:

Healthy People 2030, U.S. Department of Health and Human Services, Office of Disease Prevention and Health Promotion. Retrieved from https://health.gov/healthypeople/objectives-and-data/social-determinants-health

In this analysis, the COVID vaccine rates for each location will be the dependent variables, and various social factors will serve as independent variables.

Questions Being Investigated:

  1. How do the features from the Social Determinants of Health model, such as poverty, income, and education affect the rate of vaccination at the state level in the United States?
  2. Can we use the answers to these questions to make a model that can predict vaccination coverage at the state and county levels?

Links:

https://github.com/hduece/finaltutorial

https://hduece.github.io/finaltutorial/

References

  1. https://covid.cdc.gov/covid-data-tracker/#datatracker-home : State-level data for COVID Vaccines (vac_state)
  2. https://worldpopulationreview.com/state-rankings/median-household-income-by-state : Median Household Income by State for 2021 (state_income_df)
  3. https://worldpopulationreview.com/state-rankings/public-school-rankings-by-state : School Rating Data (school_rank)
  4. https://data.cdc.gov/Vaccinations/Vaccination-Coverage-among-Adults-18-Years-/aetd-68ew : Vaccine Coverage Data (vaccine_cov)
  5. https://corgis-edu.github.io/corgis/csv/state_demographics/ : State Demographics (Education, Employment, Ethnicity)(state_demo_df)
  6. https://worldpopulationreview.com/state-rankings/unemployment-rate-by-state : Unemployment Data (unemployment_df)
  7. https://www.kff.org/other/state-indicator/beds-by-ownership/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D : Hospital Beds Per 1000 People (beds_df)
  8. https://worldpopulationreview.com/state-rankings/most-republican-states : States' Political affiliation according to the Cook Partisian Voting Index. 16 was added to each value to make them all positive. (politics_df)
  9. https://worldpopulationreview.com/state-rankings/safest-states : States' safety score according to WalletHub research, based on 53 safety indicators. (safety_df)
  10. https://www.usgovernmentspending.com/compare_state_spending_2019b60an : State spending on public transportation. Percentage found by doing State Spending divided Gross State Product multiplied by 100.
  11. https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-County/8xkx-amqh : County Level Vaccination data (county_vacc_all)

2. Extract, Transform, and Load Data

Part 1: Obtaining Data for State Level Demographics and COVID Vaccine Statistics

We will begin by loading in our CSV files and turning them into one coherent data frame. Since so many factors are being considered, it was impossible to find a single data frame from a single source. Thus, we have imported several CSVs from different sources to create a new data frame containing all of the factors we will be exploring.

Our final, cleaned data frame contains socioeconomic information for each state. All of the data was standardized to ensure consistent scales. There is no missing data and all data types are numerical.

Part 2: Exploratory Data Analysis and Data Visualization

This map above illustrates the percent of each state population that has received at least one dose of the COVID vaccine. Our assumption is that people that have received at least one vaccine will be more likely to get completely vaccinated in time. We will be using several predictive variables to try to recreate this map without using the actual vaccination data. This data is from our vac_state dataframe, which we downloaded from the CDC.

The purpose of the heatmap above is to address any confounding variables. Some variables, such as income and poverty, and education and income, are bound to be highly correlated (positively or negatively). While we will still be exploring some of these factors individually, it is important to keep in mind that many of these variables are not independent of each other.

Determinant 1: Education Access and Quality

We will begin by looking at education and comparing different education factors of a state to the percent of the state's population that has at least one dose of the vaccine. In this section, we will calculate the correlation between Percent of State Populations with High School Educations, College Educations, Public School Rankings, and Vaccine Rates.

HS Education Rates, College Education Rates, Public School Rankings

Education Analysis

There is a strong positive correlation between having a bachelor's degree or higher and the percent of the total population that has received at least one dose of a vaccine. The correlation is 0.729796, and the scatter plot shows only a strong positive correlation between these variables. This indicates that states with highly educated populations may be more likely to have a larger percent of their population vaccinated, despite there not being a strong correlation between high school education and the vaccination rates. Public School rankings also had a fairly significant correlation of -0.478508 to vaccine rates. This relationship is negative because in our data, a higher ranking has a lower number on the ranking scale.

Determinant 2: Economic Stability

We will now be examining economic stability factors, includung the most recent data for each state about median household income, unemployment rates, poverty rates, and individuals with disabilities. All of these are compared to the percentage of people in the state that have had at least one dose of the vaccine.

Median Household Income, Unemployment Rates, Poverty Rates, Disability Rates

Economics Analysis

As expected, median household income has a significant correlation to vaccination rates. It is interesting to note on the scatterplot that many of the southern states (Louisiana, Arkansas, Mississippi, etc) tend to have lower median incomes, whereas several northern states (Massachusetts, Connecticut, New Hampshire) tend to have higher median incomes. This same pattern can also be seen in the college education plot from earlier. This is not surprising, since we already established that education and income were confounding variables. Persons Below Poverty Level also seems to impact covid vaccination rates, with a correlation of -0.443451, which means higher levels of poverty may be linked to lower vaccination rates. There were no strong correlations between individuals with disabilities and unemployment rates.

Determinant 3: Healthcare Access and Quality

Another important health determinant is how much access people have to healthcare and if that access is of high or low quality. This was a difficult subject to obtain data for, since there is no concrete way to measure healthcare accessiblity. In this section, we will examine the percentage of people in a state without access to health insurance as well as how many hospital beds are available per 1000 people in each state.

Percent without Health Insurance, Hospital Beds Per 1000 People

There were notable correlations between uninsured individuals and the number of hospital beds. States with more insured people tend to have higher vaccination rates. Unexpectedly, states with less hospital beds had populations with higher vaccination coverage. This could mean that areas with less hospital beds have healthier populations, but it is also possible that the number of hospital beds is not the best way to measure access to healthcare resources.

Determinant 4: Neighborhood and Built Environment

Another important health determinant is the neighborhood and built environment context. In this category, we will look at the safety ranking of public schools in each state. We felt this metric would provide the most general measure in terms of community safety for each state, but in the future, it would also be interesting to explore crime and incarceration rates per state in relation to safety and vaccine coverage.

Public School Safety

Safety Analysis

Public school safety has a notable correlation of 0.479008 to COVID vaccination rates. Again, we see that many southern states rank lower in terms of safety, whereas many northern states are high-ranking in safety. This data tells indicates that safer communities may be linked to higher vaccination coverage.

Determinant 5: Social and Community Context

The final determinant category we will be looking at is social and community factors. These will include race & ethnicity makeups, under 18 percentage, population density and political affiliation.

Percent Non-Racial Minority (White), Percent Under 18, Percent Over 65, Population Density, Political Affiliation

Political affiliation had a very high correlation of -0.853151 to vaccine rates. Population density and percent of population under 18 also had fairly strong correlations to our vaccine data. Surprisingly, older populations above 65 was not strongly correlated, which we did not expect because older populations are more at risk for COVID.

The graph above provides a nice summary of the correlations found in our data exploration. To create analyzie our data more consicely, we will only be considering variables that had correlations greater than or equal to 0.4 and less than or equal to -0.4. Thus, the variables we will consider are political affiliation, age under 18, hospital beds, health insurance, public school rankings, poverty, safety, population density, income, and college education.

Part 3: Building The Prediction Model

Using the correlations we have found between these variables and the vaccination rates of each state, we are able to build a general model that can recreate the ranking of each state (or rank other similar entities). The model we have developed is this:

$\sum$ (Correlation * Variable) for all variables available = Vaccination Score

The higher a vaccination score, the higher a state's vaccination rate is predicted to be. Variables can be removed from the equation for a less accurate model if that variable is unavailable for the entity the model is being used for. The model works best with entities that are similar to U.S. states.

The factors that we found to most strongly correlate to vaccination status include political affiliation, percent of population with at least college education, and income. Again, we are only considering variables that we found to have significant correlations with vaccination data.

Our graph and the scores for each state fit our predictions of what states would perform best with vaccination rates, with northeastern states performing very well and southern states being at the lower end, similar to our actual graph.

Because we created this formula based of of only 50 states, we were unable to create a model to predict covid coverage for the sates due to overfitting. However, we calculated the Euclidean distances between our COVID scores and the literature vaccine coverage values to see which states our calculations performed best for. Larger distances are equated to less accurate predictions. It should also be notes that because we are calculating scores based on our own system and not vaccine rates, our results are not meant to match perfectly, but we predict states with higher scores should have higher vaccination rates. In the plot above, you can see that states with higher scores tend to have higher numbers of vaccinated people. Some states such as Florida and New Mexico are very off, so clearly there are errors in our model.

We can also attempt to calculate the percent error of our predictive model. We cannot do this perfectly, as our predictions are not percent vaccinated but rather a score that represents the relative highness of a state's vaccination rate. In order to see how close we were, we can assume that we got one state correct and see how far off the other states are from that. Massachusetts is ranked #1 in vaccinations in both our predictions and the actual data, so we will scale our predictive scoring on the assumption that we get the Massachusetts percentage correct so that we may look at the percent error of the other 49 states relative to how close we were to the Massachusetts value. Our percentage spread and our score spread are different, at a ratio of 27.7:17, or 1:0.61. To fix this, we will need to take a few steps:

1) Add the lowest score +1 to everything to remove all negative numbers, starting our scoring system at 1.

2) Multiply all of our scores by 0.61 so our scores are on the same scale.

3) After doing all of this, our Massachusetts predicted score is 49.6 lower than the real Massachusetts score. We then added this number to all of our values, bringing the percent error of Massachusetts to 0 so that we can examine the other errors relative to this one.

4) Now we can calculate percent error of all other states relative to how close we were with our Massachusetts variable by using the standard percent error formula: (predicted val-actual val)/actual val * 100.

Compared to how close we were with our Massachusetts estimate, our average percent error with our other estimates was 6.5%.

Exploring County Level Data

Now, we will use our scoring system to score each county based on the socioeconomic factors that we looked at at the state level. We will then attempt to see if we can create a model that will predict vaccine rates based on our COVID scores. Some of the variables we examined at the state level were unavailable or unfeasible at the county level, so this is based off of the variables we did have access to at this level.

Now we have county data that we can analyze, yay! It should be noted that the most recent county level data that we could find for a lot of the features we wanted to analyze was from 2017. However, the COVID vaccination data is relatively new as of 2019-2021, so this will undoubtedly result in some error in our analysis.

The map above visualizes how each county ranked on our scoring system. The higher the score, the higher the expected vaccine coverage for a county.

County Level Prediction Model

The purpose of this model is to see how accurately our formula (adapted to the county-level data available to us) performs by using our formula to score each county based on their socioeconomic stauses, and then see how well these scores can predict vaccination rates

The bar plot provides a visual representation of the predicted vs. the actual vaccination rates. Clearly, there are some large margins of errors in our predictions, but as mentioned previously, there were less correlational factors to consider at the county level since less data was available.

The map above visualizes the predicted vaccination rate of each county that our model predicted for available counties based on the score we calculated for each county. In essence, we attempted to build a model that could predict vaccination rates based on the scoring system we created from the state level data. Some counties are missing because since our county level data frame for vaccination rates did not have a state column, we had to drop duplicate counties since our data was based on the average vaccine rate for each county, we didn't want averages from counties over different states.

Final Conclusions

While our prediction model did not form particularly well in estimating vaccination rates for the states and counties, it was interesting to see how our state-level scoring system could be applied at the county level. It is interesting to see which factors have the largest correlations to vaccination rates and which states these findings hold true for. The factor we observed to have the largest impact on vaccination rates for states was political affiliation. It would be very interesting to see if this holds true in other countries across the world.

According to our predictions, the states that should have the highest vaccination rates are Massachusetts, Connecticut, Maryland, New Jersey, and Vermont. The actual states with the highest vaccination rates are Massachusetts, Vermont, Hawaii, Connecticut and Rhode Island. Our scoring system put Hawaii at number 8, Vermont at 6, and Rhode Island at 7, so even though our predicted ordering was not exactly perfect, we were rather close at predicting the general ranking of the states. The states with higher scores tend to have higher socioeconomic statuses, which is what we had initially predicted.

Recommendations for the Future

There are many places this research could be used in the future. States with lower vaccination rates can be more targeted in distribution campaigns, and if they are estimated to have a low vaccination rate because of specific variables, we can use that to have a more targeted approach. It would also be interesting to look at correlation numbers for other parts of the world and see if similar factors have a similar amount of impact on vaccination. We could also apply our prediction formula for the US to these other parts of the world and see how other countries would be vaccination wise if they were affected by the factors in the same way that the USA is. Machine learning could even be incorporated to see if factors that humans could not recognize could be discovered. There are a lot of weaknesses in this project due to the lack of available data and the inability to analyze every factor, but it was interesting to visualize our hypotheses and put them to the test!